Putting Visual Analytics into Practical Use: VAST Challenge 2022, Challenge 3 Economic.
In this take-home exercise, the economic of the city of Engagement, Ohio USA will be revealed by using appropriate static and interactive statistical graphics methods.
With reference to Challenge 3 Question 1 of VAST Challenge 2022, the following questions will be addressed:
Over the period covered by the dataset, which businesses appear to be more prosperous? Which appear to be struggling? Describe your rationale for your answers.
It is observed from the datasets provided by VAST Challenge 2022 that there are three types of businesses present in Engagement, Ohio USA, namely:
In this exercise, the robustness of different types of businesses will be evaluated by different criteria as the data available is different.
Workplaces
For workplaces, data is available on employees, jobs provided, wages, educational level requirement and etc. For restaurants and pubs, data is available on prices, customers’ visits, spending and etc. Therefore, in this exercise, workplaces will be evaluated base on two main criteria:
Restaurants and Pubs
On the other hand, restaurants and pubs will be evaluated based on:
According to the dataset descriptions provided by VAST Challenge, all restaurants have a Prix Fixe food cost for participants to dine in and all pubs have a hourly cost to visit the pub. Therefore, assuming all visits to restaurants are for dinning, restaurants’ revenue will be calculated by number of visits times Prix Fixe food cost. Similarly, pubs’ revenue will be calculated by duration of visits times hourly cost of visits.
Alternative approach of deriving balance difference before and after restaurants and pubs visits as spending is considered but not preferred as balance differences are inconsistent and could be due to unknown reasons.
The following code chunk installs the required R packages and loads them onto RStudio environment.
packages = c('ggiraph', 'plotly', 'DT', 'patchwork',
'gganimate', 'tidyverse','readxl', 'gifski',
'gapminder', 'treemap', 'treemapify', 'rPackedBar',
'trelliscopejs', 'zoo', 'd3treeR', 'ggridges')
for (p in packages){
if(!require(p, character.only = T)){
install.packages(p)
}
library(p,character.only = T)
}
Relevant datasets are imported using read_csv() of readr
package, which is useful for reading delimited files into tibbles.
jobs <- read_csv('rawdata/Jobs.csv')
pubs <- read_csv('rawdata/Pubs.csv')
restaurants <- read_csv('rawdata/Restaurants.csv')
travel <- read_csv('rawdata/TravelJournal.csv')
The following code chunk is used to have an overview of the datasets.
File jobs is cleaned by renaming values for ease of
reading. A new file jobsedu is created using
group_by() to reveal data on jobs offered for different
education requirements.
jobs$educationRequirement <- sub('HighSchoolOrCollege',
'High School or College',
jobs$educationRequirement)
The following code chunk extracts travel records related to
restaurants and pubs using filter() and derives spending of
each visit using inner_join() and
mutate().
Datasets are also cleaned by removing irrelevant columns using
select() and renaming column names using
rename() for ease of understanding.
restaurantstr <- travel %>%
filter(purpose == 'Eating') %>%
mutate(travelTime = travelEndTime - travelStartTime) %>%
select(-c(travelStartTime:travelEndTime, endingBalance)) %>%
inner_join(y= restaurants,
by = c('travelEndLocationId'= 'restaurantId')) %>%
mutate(visitDuration = checkOutTime - checkInTime) %>%
select(-c(purpose, location, checkOutTime)) %>%
rename('restaurantId' = 'travelEndLocationId',
'spending' = 'foodCost')
pubstr <- travel %>%
filter(purpose == 'Recreation (Social Gathering)') %>%
mutate(travelTime = travelEndTime - travelStartTime) %>%
select(-c(travelStartTime: travelEndTime, endingBalance)) %>%
inner_join(y= pubs,
by = c('travelEndLocationId'= 'pubId')) %>%
mutate(visitDuration = checkOutTime - checkInTime,
spending = as.numeric(visitDuration/60)* hourlyCost) %>%
select(-c(purpose, location, checkOutTime)) %>%
rename('pubId' = 'travelEndLocationId')
The following code chunk is used to check for missing values.
The cleaned datasets are saved and read in RDS format to avoid uploading large files to Git.
The following interactive dotplot shows that the size of the workplaces in Engagement ranges from 2 to 9 employees. Therefore those with 9 employees are the biggest businesses in workplace category.
jobsnum <- jobs %>%
group_by(employerId) %>%
summarise(jobNum = n(),
totalPay = sum(hourlyRate),
avgPay = mean(hourlyRate))
tooltip_css <- 'background-color: #008080;
font-stype: bold; color: white'
jobsnum$tooltip <- c(paste0('Employer ID: ', jobsnum$employerId,
'\n Number of Employees: ', jobsnum $jobNum))
p <- ggplot(data = jobsnum, aes(x= jobNum)) +
geom_dotplot_interactive(aes(tooltip = tooltip),
fill = '#bada55',
stackgroups = TRUE,
binwidth = 0.1,
color = NA,
method = 'histodot') +
scale_y_continuous(NULL, breaks = NULL) +
scale_x_continuous(limits = c(1, 10),
breaks = c(1,2,3,4,5,6,7,8,9,10),
labels = c(1,2,3,4,5,6,7,8,9,10)) +
labs(x= 'Number of Employees',
title = "How Many Jobs Is Each Workplace Provding?",
subtitle= 'Economic in Engagement, Ohio',
caption = "Source: VAST Challenge 2022") +
theme(panel.grid.major = element_line(color= 'grey', size = 0.1),
panel.background= element_blank(),
axis.line= element_line(color= 'grey'),
plot.caption = element_text(hjust = 0))
girafe(ggobj = p,
width_svg = 8,
height_svg = 8*0.618,
options = list(opts_tooltip(css = tooltip_css)))
The following code chunk plots a static treemap and converted to an interactive treemap plot based on the total and average hourly employee pay provided by each workplace.

The size of the treemap is based on the total employee hourly pay and that of the color is based on the average employee pay. Therefore, it is observed from the plot that some employers, although of a smaller size, are paying high average rate to their employees. This indicates that these businesses are more prosperous as they can afford higher manpower costs. On the other hand, smaller companies that are also paying low rates for their employees are likely to be struggling, because it indicates that they are either low on funding or they will lose talents to other businesses and cost business development in the long run.
d3tree(tm, rootname = 'Employee Hourly Pay by Workplace')
Next, the average workplace employee wage based on different education levels are examined. Packed bar plots are created and the top 10 employers offering highest average rates are selected to view.
Before plotting, the follow code chunks prepares four different datasets of job data on different education requirements.
jobsedu <- jobs %>%
group_by(employerId, educationRequirement) %>%
summarise(jobnum = n(),
avgHourlyPay = round(mean(hourlyRate),2),
totalHourlyPay = sum(hourlyRate)) %>%
rename('Average Hourly Pay' = 'avgHourlyPay')
jobsedu1 <- filter(jobsedu, educationRequirement=='Low')
jobsedu2 <- filter(jobsedu, educationRequirement=='High School or College')
jobsedu3 <- filter(jobsedu, educationRequirement=='Bachelors')
jobsedu4 <- filter(jobsedu, educationRequirement=='Graduate')
The following code chunks plot four packed bar charts of average hourly pay for jobs of different education requirements. These charts reveal businesses that are paying more for jobs of the same education requirements.
p1 <- plotly_packed_bar(input_data = jobsedu1,
label_column = 'employerId',
value_column = 'Average Hourly Pay',
number_rows = 10,
plot_title = 'Top 10 Workplaces for Low Education - by average hourly pay',
xaxis_label = 'Average Hourly Pay',
hover_label = 'Average Hourly Pay',
min_label_width = 0.001,
color_bar_color = '#66cdaa',
label_color = 'white')
plotly::config(p1, displayModeBar = FALSE)
p2 <- plotly_packed_bar(input_data = jobsedu2,
label_column = 'employerId',
value_column = 'Average Hourly Pay',
number_rows = 10,
plot_title = 'Top 10 Workplaces for Low Education - by average hourly pay',
xaxis_label = 'Average Hourly Pay',
hover_label = 'Average Hourly Pay',
min_label_width = 0.001,
color_bar_color = '#66cdaa',
label_color = 'white')
plotly::config(p2, displayModeBar = FALSE)
p3 <- plotly_packed_bar(input_data = jobsedu3,
label_column = 'employerId',
value_column = 'Average Hourly Pay',
number_rows = 10,
plot_title = 'Top 10 Workplaces for Low Education - by average hourly pay',
xaxis_label = 'Average Hourly Pay',
hover_label = 'Average Hourly Pay',
min_label_width = 0.001,
color_bar_color = '#66cdaa',
label_color = 'white')
plotly::config(p3, displayModeBar = FALSE)
p4 <- plotly_packed_bar(input_data = jobsedu4,
label_column = 'employerId',
value_column = 'Average Hourly Pay',
number_rows = 10,
plot_title = 'Top 10 Workplaces for Low Education - by average hourly pay',
xaxis_label = 'Average Hourly Pay',
hover_label = 'Average Hourly Pay',
min_label_width = 0.001,
color_bar_color = '#66cdaa',
label_color = 'white')
plotly::config(p4, displayModeBar = FALSE)
The following code chunk plots the monthly customer visits bar chart
for each restaurant using treliscopejs. By viewing sort by
visits, revenue or price, we are able to identify the restaurants that
are prospering or struggling. While profit margin information is not
available, as the price of the restaurants do not vary significantly, it
is safe to assume that those with high customer visits and high revenue
are doing well and vice versa.
restaurants <- restaurants %>%
mutate(yearmonth = as.yearmon(checkInTime))
restaurantsv <- restaurants %>%
group_by(restaurantId, yearmonth) %>%
summarise(visits = n(),
revenue = sum(spending),
price = mean(spending))
r <- ggplot(restaurantsv, aes(x= as.factor(yearmonth), y= visits)) +
geom_col(fill= '#008080') +
labs(x= 'Month Year', y= 'Number of\nCustomer\nVisits',
title = 'Monthly Customer Visits by Restaurant') +
facet_trelliscope(~ restaurantId,
nrow = 2, ncol = 2, width = 800,
path = 'trellisr/',
self_contained = TRUE) +
theme(axis.title.y= element_text(angle=0),
axis.ticks.x= element_blank(),
panel.background= element_blank(),
axis.line= element_line(color= 'grey'))
r
Similarly, the monthly customer visits bar chart for pubs are also
plotted using trelliscopejs. By viewing sort by visits,
revenue or price, we are able to identify the pubs that are prospering
or struggling.
pubs <- pubs %>%
mutate(yearmonth = as.yearmon(checkInTime))
pubsv <- pubs %>%
group_by(pubId, yearmonth) %>%
summarise(visits = n(),
revenue = sum(spending),
price = mean(hourlyCost))
pub <- ggplot(pubsv, aes(x= as.factor(yearmonth), y= revenue)) +
geom_col(fill= '#008080') +
labs(x= 'Month Year', y= 'Number of\nCustomer\nVisits',
title = 'Monthly Customer Visits by Pub') +
facet_trelliscope(~ pubId,
nrow = 2, ncol = 2, width = 800,
path = 'trellisp/',
self_contained = TRUE) +
theme(axis.title.y= element_text(angle=0),
axis.ticks.x= element_blank(),
panel.background= element_blank(),
axis.line= element_line(color= 'grey'))
pub